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Abstract 

The ability to rationally modify targeted physical and biological features of a protein of interest holds promise in numerous 
academic and industrial applications and paves the way towards de novo protein design. In particular, bioprocesses that 
utilize the remarkable properties of enzymes would often benefit from mutants that remain active at temperatures that are 
either higher or lower than the physiological temperature, while maintaining the biological activity. Many In silico methods 
have been developed in recent years for predicting the thermodynamic stability of mutant proteins, but very few have 
focused on thermostability. To bridge this gap, we developed an algorithm for predicting the best descriptor of 
thermostability, namely the melting temperature T,,,, from the protein's sequence and structure. Our method is applicable 
when the T,„ of proteins homologous to the target protein are known. It is based on the design of several temperature- 
dependent statistical potentials, derived from datasets consisting of either mesostable or thermostable proteins. Linear 
combinations of these potentials have been shown to yield an estimation of the protein folding free energies at low and 
high temperatures, and the difference of these energies, a prediction of the melting temperature. This particular 
construction, that distinguishes between the interactions that contribute more than others to the stability at high 
temperatures and those that are more stabilizing at low T, gives better performances compared to the standard approach 
based on T-independent potentials which predict the thermal resistance from the thermodynamic stability. Our method 
has been tested on 45 proteins of known T,,, that belong to 1 1 homologous families. The standard deviation between 
experimental and predicted T,„'s is equal to 13.6°C in cross validation, and decreases to 8.3°C if the 6 worst predicted 
proteins are excluded. Possible extensions of our approach are discussed. 
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introduction 

In the last decade there has been a growing attention on the 
study of the thermal stability of proteins and a lot of effort from 
both the theoretical and experimental sides have been devoted to 
understand its molecular basis. The potential applications are very 
broad and include the possibility to rationally modify the thermal 
stability of targeted proteins and hence optimize the bioprocesses 
in which they are involved [1-3]. This opens interesting 
perspectives in all academic and industrial sectors that exploit 
the unique properties of proteins, such as food industry, biofuel 
production, detergent industry, remediation of environmental 
pollutants, therapeutic approaches and drug design [4-6]. 

As a first step, it is quite important to gain theoretical 
understanding of the biophysical principles behind thermal 
stability. In a series of works [7-17] the mechanism and the 
interactions that promote or prevent thermal stabilization have 
been investigated. This is a highly non-trivial issue due to the large 
number of factors that influence the thermostability and to the 
marginal stabilization reached by the delicate balance between 
opposite energetic contributions. A series of factors has been 
indicated as responsible for the enhancement of the thermal 



resistance, based on the analysis of the amino acid conservation 
among the meso- and thermostable proteins belonging to the same 
homologous family. However, these factors are often not universal 
and family-dependent. 

More general investigations of the factors that influence the 
thermal resistance have been performed using free energy 
calculations with a continuum solvation model [18]. They have 
led to the idea that salt bridges promote hyperthermostability in 
proteins, whereas they make little contribution to protein stability 
at room temperature. This idea is supported by a lattice model 
which suggested that salt bridges contribute not only on the 
stabilization of the native states but also to the destabUization of 
the misfolded conformations [19]. Moreover, on the basis of 
temperature-dependent statistical potentials, it has been shown 
that not only salt bridges, but also cation-7t interactions, aromatic 
interactions, and hydrogen bonds between negatively charged and 
some aromatic residues tend to thermostabUize proteins, whereas 
hydrophobic packing appears to be neutral in this respect [20,21]. 

Several approaches have been devised for designing mutants 
that are more thermally stable than wild-type proteins. Experi- 
mental methods include directed evolution, sometimes coupled 
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with rational or semi-rational engineering strategies [22,23]; for a 
review see [24] and references tfierein. In silico engineering 
approaches have also been developed, which are based on residue 
conservation within homologous families, on structural and 
dynamical features, or on free energy calculations [25-29]. A 
sequence-based in silico method for predicting melting tempera- 
tures has been developed and applied to distinguish hyperthermo- 
philic from mesophilic microorganisms [30] . Even if these methods 
are partially successful, new, faster, more powerful and precise 
techniques would be welcome. 

It is noteworthy that a lot more computational methods have 
been developed to predict the thermodynamic stability of a protein 
- in particular the thermodynamic stability changes upon point 
mutations (for review of their performances, see [31-34]). These 
are often used to also predict thermal stability, although thermal 
and thermodynamic stability are only very imperfectly correlated. 
Indeed, the thermodynamic stability at a given temperature is 
defined by the folding free energy AG at that temperature, and the 
thermal stability by the melting temperature T„,. In Figure 1, one 
can find an example of the stability curves of two hypothetical 
proteins, one mesostable and the other thermostable, with 
approximately the same thermodynamic stability at room 
temperature (given by the AG* value) but with a significative 
difierence in thermal stability (given by AT„, = r*^'™ - T^'") of 
about 50°C. There is thus a need to develop efficient and fast 
thermal stability predictors, without detour through thermody- 
namic stability. 

The aim of this paper is to build an in silico method that directly 
predicts T„,, which is the best descriptor of thermal stability. For 
that purpose we have generalized and optimized the set-up 
introduced in [20,21] for defining temperature-dependent statis- 
tical potentials. This set-up was originally devised for distance 
potentials that describe tertiary interactions, based on propensities 
of residue pairs to be separated by a certain spatial distance. Here 
we apply it to also define temperature-dependent torsion 
potentials, which describe local interactions along the polypeptide 
chain and are based on propensities of residues to be associated 
with backbone torsion angle domains [35]. The main idea behind 
the construction is that, since thermodynamic and thermal stability 
are not always correlated, some new potentials that are defined at 



different temperatures and thus take into account the thermal 
properties of the intra-protein interactions have to be introduced 
besides the standard statistical potentials that are defined at an 
average temperature. This construction is illustrated in Figure 2. 
The practical implementation consists of building different 
datasets of proteins with known melting temperature and deriving 
statistical potentials from each of these; because of the limited 
amount of data only two sets were considered, a mesostable and a 
thermostable one. Since there are not enough experimentally 
resolved structures with known Ti„, we have enlarged the datasets 
by introducing some proteins with unknown T,,, but for which a 
crude estimation of T„, could be obtained from the environmental 
temperature of the host organism. This allowed us to derive 
smoother potentials and to obtain better performances. 

Once the potentials were derived, they were used to give a quite 
accurate prediction of the melting temperature of a target protein, 
using additional information about the T),, of homologous 
proteins. The overall flowchart of the method is summarized in 
Figure 3. Its performance was compared to that of the common 
procedure that uses temperature-independent potentials and 
hence predicts thermal resistance from thermodynamic stability. 

Methods 

Basic protein dataset S and homologous families 

To define temperature-dependent potentials, we used the 
protein dataset defined in [20] and denoted as S, which contains 
166 protein X-ray structures with resolution <2.5 A and known 
melting temperature T,,, measured for the transition from the 
monomeric state to the denatured state. They were collected from 
the literature and the ProTherm database [36], and manually 
checked on the basis of the original articles. If several T^-values 
were available for a given protein, we chose the at the pH 
condition closest to 7; if different T^'s were available at the same 
condition the average value was taken. In Table SO in File SI all 
the proteins belonging to this set and their characteristics are 
reported. 

In this dataset, 1 1 famihes consisting of at least three 
homologous proteins were identified, whose melting temperatures 
will be predicted later in this paper and compared to the 



AG(i<;J/moI) 
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Figure 1. Thermal versus thermodynamic stability. An example of the stability curves of an hypothetical couple of mesostable and 
thermostable proteins, characterized by an equal thermodynamic stability at room temperature, but different thermal stabilities. 
doi:1 0.1 371 /journal.pone.0091 659.g001 
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Figure 2. Folding free energies at different temperatures. Plot of the stability curve as a function of the temperature, and of the values of the 
three folding free energies AG^, AG* and AG*^ at the respective temperatures , T*, T^, for a hypothetical protein. 
doi:1 0.1 371 /journal.pone.0091 659.g002 



experimental melting temperatures. These are: a-amylase, lyso- 
zyme, myoglobin, j8-lactamase, ot-lactalbumin, acylphosphatase, 
adenylate kinase, cell 12A endoglucanase, cold shock protein, 
cytochrome P450 and ribonuclease. 

Enlarged, family-dependent, protein datasets Sf 

In view of constructing smoother potential.s and designing a T),,- 
predictor that is specific for the proteins belonging to a given 
family /, we have enlarged the basic dataset S. For each of the 1 1 
families /, in turn, additional proteins belonging to / were added 
to the dataset S so as to create the family-dependent dataset 



denoted as Sf. This procedure thus defines 1 1 different datasets 
Sf, one for each family. 

In contrast to the proteins from S, the T^'s of the additional 
proteins in Sf have not been characterized experimentally; only 
the environmental temperature of their host organism, T^ny, is 
known. This temperature refers to the optimal growth temperature 
for the micro- and cool-blooded organisms, while for the warm- 
blooded ones it is defined as the body temperature. The values of 
the Tei„ we are using (listed in Tables Sl-Sll in File SI) were 
manually checked from the literature. When no optimal growth 
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Figure 3. Flowchart of the T,,, prediction method for a protein /; belonging to the family /. 

doi:1 0.1 371/journal.pone.0091 659.g003 
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temperature was reported for a given microorganism, we took the 
mean of the range of temperatures over which it is able to grow. 

In order to obtain an estimation of the melting temperature of 
these additional proteins, three different methodologies were used. 
We would like to stress that these estimations do not pretend to 
yield a reliable prediction of the Tm, but they yield a rough 
approximation allowing us to decide if they belong to the set of 
thermostable or mesostable proteins, as explained later. 

The first two methods for estimating the T„,\ are based on the 
environmental temperature T^nv It is well known that and Tei„ 
are correlated, since thermophilic organisms necessarily host 
thermostable proteins (even if the converse is not true). Based on 
experimental data on families of homologous proteins, a correla- 
tion between and was indeed observed and the 
corresponding regression line was computed [38,39]. The 
regression line obtained in [39] is: 

7'«^'' = 0.62r<,„„+42.9°C. (1) 

The associated correlation coefficient, noted r*'' and computed 
without cross validation, is equal to 0.82. The T^^^'i derived with 
this formula are listed in Table SI— Sll in File SI. 

However this correlation was derived regardless of the type of 
proteins. One can expect that inside a given family of homologous 
proteins the correlation between and Te„^ is stronger due to the 
fact that the thermostability is in some way related to specific 
protein characteristics. We thus calculated the linear regression 
between Tm and Te„v inside each family, even though the number 
of proteins per family is small and the statistical significance of the 
correlation questionable. The estimated Tj^^'^'^^'s so obtained are 
listed in Tables S 1— S 1 1 in File S 1 and the regression lines for each 
family are given in Table S14 in File SI. The mean of the 
correlation coefficients r^^^ computed inside each family is equal to 
0.84 (without cross validation) and is thus almost equivalent to the 
correlation coefficient r*'* calculated on all families together. Note 
the peculiar case of the a-lactalbumin family (see Table S5 in File 
SI) for which the coefficients of the regression line are very 
different from the others. This family contains three proteins that 
bek)ng to three warm-blooded organisms with ver\' close Tgn^'s, 
[Homo sapiens 37°C, Bos taurus 38"C and Capra hircus 39°C) but T^'^ 
that differ by more than 30°C. The T„-Tg„y, regression line 
obtained from these proteins is thus probably not reliable. The 
regression line of the lysosyme family is also atypical, but to a lesser 
extent. 

The last method to estimate T„h is based on the sequence 
similarity fx^twcc'n the proteins. We assign as T,„ of a given protein 
the melting temperature of the protein of the same family that 
exhibits the highest sequence identity. This quite strong assump- 
tion is justified by the fact that, often, the higher the sequence 
identity, the higher the similarity among all structural, functional 
and thermodynamic characteristics, including thermostability. For 
that purpose, we performed pairwise alignments of all the 
sequences inside each family using the FASTA program [40]. 
The T^^'^'-'s. estimated on the basis of these results are reported in 
Tables Sl-Sll in File SI. 

Thermostable, mesostable and average protein datasets 

Sf, Sj and Sf" 

Each of the 1 1 family-dependent sets Sf was divided into two 
equal subsets: the mesostable ensemble Sj containing the proteins 
with (either known or estimated) T„ smaller than a certain 
threshold value and a thermostable set in which all proteins 



have Tm>T„. The threshold value T„ was determined in such a 
way that the two subsets contain an equal number of proteins; it 
thus shghtiy depends on /. 

Each suf)set was refined separately using the protein-culling 
server PISCES [37]. For each pair of proteins in a given subset 
that presents a sequence identity > 25%, only one protein was kept 
according to the following criteria: (1) when one protein has a 
known Tm while the other has an estimated T^ we chose the 
protein with known T^', (2) when both proteins have either an 
experimentally determined or an estimated T„,, we chose the 
one with highest T„j in the thermostable set and with lowest T„ in 
the mesostable set. This procedure prevents significant sequence 
similarity to occur inside each subset, which could bias the 
predictions. It also allows us to increase the difference between the 
average melting temperatures T„ of the meso- and thermostable 
subsets, so as to get more differentiated temperature-dependent 
potentials. 

We also constructed 1 1 family-dependent datasets from Sf. 
These sets were not split in two, but were refined using PISCES 
with the criterion that when two proteins (with both either known 
or estimated T,„) show a high degree of sequence identity (>25%), 
the protein with a melting temperature closest to the mean T^ is 
kept and the other is discarded. This rule is not applied when one 
protein has an estimated T,„ and the other a known T,„', in such 
case the protein with known is kept and the protein with 
estimated is discarded. 

This procedure yields, for each of the 1 1 families /, three 
protein datasets, a mesostable set Sj, a thermostable set 5^, and 
an average set S^. Each of these sets is characterized by T„, 
defined as the average of the melting temperatures of the proteins 
belonging to the set. This average temperature depends on the 
considered family. The dependence is, however, very small, and 
we will for the simplicity of the notations not add a subscript / to 
Tm. The values of the 7^'s associated to the different datasets are 
given in Table SI 3 in File SI. 

Stastlstlcal potentials 

Temperature- and family-dependent statistical potentials were 
derived from the datasets which are each character- 

ized by a different average melting temperature T^. This is done 
using the Boltzmann law, following [20,21]: 



AWf(s,c,T„)^-kT In 



F(s,c,f,„) 
F(s,T„,)F(c,T,„y 



(2) 



where s represent single amino acids or amino acid pairs, and c 
spatial distances between residue pairs or backbone torsion angle 
domains; F represent relative frequencies computed in the dataset 
of average melting temperature Tffi^ 1.6. 
F(s,c, Tm) = n(s,c, Tm)/ n(Tm). 

In particular, we built two distance potentials and two torsion 
potentials. In the torsion potentials, .s correspond either to the 
amino acid type a,- of residue ; or to the amino acid types (ai,a/) of 
residues i and j, and c corresponds to the backbone torsion angle 
domain tk of residue k. Seven (^,i/',(b) torsion angle domains were 
used, defined in [41]. These potentials describe local interactions 
along the chain: i<j and ije{k — 8,k + S}. They are denoted as 
AW{a,t,T„) and AW{a,a',t,Tm). 

In the two distance potentiads, the structure motif c is the spatial 
distance dg between the residues / and J, with j>i+l. In 
AlV(a,a' ,d,Tm), residues / and j are of type a and a'. In 
AW(a,d,T„), residue / or J is of type a and the other is of arbitrary 
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type. We defined the distance between two residues as the distance 
between the geometrical center of the heavy side-chain atoms [20]. 
The distance values between 3.0 and 8.0 A were grouped into 
25 bins of 0.2 A width; two additional bins describe distances 
larger than 8.0 A and smaller than 3.0 A, respectively. Moreover, 
we used a trick to artificially increase the number of occurrences in 
each bin and thereby smooth the potential. We summed the 
occurrences of neighboring bins, giving them a decreasing weight: 



- + - 



+ ...«' + 



(3) 



where n' represents the number of occurrences n{c,s,T„) or 
n{c,Tm) in bin /, and I is set equal to 3; n{s,T„) and n{Tm) are 
normalized consequendy. 

In order to deal with the limited size of the datasets, a correction 
for sparse data [35] is applied: 



F(s,c,T„)- 



gF(s,T^)Fic,T„)+n'Fis,c 



(4) 



where the expected numiicr of occurrences is 
rf = n{s,T„)n{c,T„)/n(T„), and g an adjustable parameter. This 
correction ensures that the potentials are close to 0 when the 
number of observations in the dataset is too small. The value of c: 
was chosen to be equal to either 10 or 20. 

We computed all the statistical torsion and distance potentials 
AWf{s,c,T„) using the two values of q and the three difiFerent 
procedures for estimating T„ from Te„v, described in the previous 
subsections. This yields six different series of APF/(.y,c,T'„,)'s. The 
final torsion and distance potentials that we consider in the 
following correspond to the average of these six potentials. 

Prediction of the melting temperature Tm 

The folding free energ)' AG at some temperature referred to as 
T„ of a protein p that belongs to the family / is evaluated by a 
finear combination of the four torsion and distance potentials 
defined in Eq. (2), which are derived from the sets of proteins (S^, 
Sj and 5^) of average melting temperature T„: 



AGpef{T„)= jjr- 



boiTm) AWf{ai,aj,dij,T„) 



+ biiT„) ^ KWf{ai,dij,T„ 



(5) 



ij,k=i 



where / # jj + 1 for the distance potentials, A: — 8 < / <J < A: + 8 for 
the torsion potentials, A// is a family dependent normalization 
factor, and Np is the number of residues of p. Let us for simplicity 
denote as AG^, AGj and AG^ the family- and T-dependent 
folding free energies of protein p belonging to / computed using 
the statistical potentiels derived from the sets ^y, Sj and 5^, 
respectively. 



We predict the melting temperature on the basis of these 
potentials in two different ways. In the first, we assume that the 
melting temperature is proportional to the average folding free 
energy AC^. This is the common procedure that predicts thermal 
from thermodynamic stability. In the second, original, method, we 
assume that the melting temperature is proportional to the 
difference in folding free energy at two different temperatures: 
[AG'^ — AG^]. In these two procedures, the parameters, generi- 
cally denoted as P, are optimized so as to minimize the standard 
deviation between the predicted and experimental melting 
temperatures of the ensemble of considered proteins; we use for 
that purpose the minimization function implemented in Mathema- 
tica 7. More precisely: 



= arg min[ V (c^^ [aG^ - AGjl + d^^ - T^^f] , 

pAV - -I 



: arg mm 
pO p 



AG'^l+d'-^-T^^pY 



(6) 



where P^^ = (h^,hl,bf,bJ,b^,bl,bl,bJj^f,c^'',d^'') and 
P*^ = (b^,bf,b2,b^J^f,c'^,d'^); the sum over p in these expressions 
means the sum over all the proteins with known melting 
temperature T^,p that belong to the 11 homologous families. 
The coefficients (c^^,c*) and (d^^,d^) give, respectively, the slope 
and the intercepts of the regression line between computed folding 
free energies and experimental melting temperatures that best fit 
the data. 

In order to a\()id overestimating the performance of our 
method, we performed cross validation using the jack-knife 
technique: the parameters are identified on all proteins but one, 
which is used as test protein; every protein in turn is considered as 
test protein, and the average score is considered. 

Results 

The contributions of amino acid interactions to protein stability 
are known to be temperature-dependent; some may be more 
stabilizing than others in the high temperature regime and less 
stabilizing than others at low T, or conversely [18,20,21,42,43]. 
Such dependence need to be taken into account for a proper 
analysis of thermal stability properties. For that purpose, we 
created different datasc-ts of proteins with known melting 
temperatures: in sets only mesostable proteins were considered, 
in sets all entries are thermostable, and in 5"^ sets all proteins 
were taken independently of their T,,,. Each ensemble has been 
associated with a temperature T„ computed as the mean of the T„ 
values of the proteins belonging to the set. 

Predicting the melting temperature of a protein from its 
structure alone is quite a difficult task, and we therefore focus on 
the slightly simpler problem of predicting this temperature using 
information from homologous proteins. We hence selected 1 1 
families of proteins of known T^, labelled by /, and defined 1 1 
triplets of sets S^,Sj,Sj^, by adding proteins belonging to the 
family to the complete set S, following the procedure explained in 
the Methods section. 

From each of these datasets characterized by an average melting 
temperature T„, two torsion potentials and two distance potentials 
have been derived using the standard statistical-potential formal- 
ism that converts the relative amino acid frequencies into free 
energy trough the Boltzmann law (Eq.(2)). The torsion potentials 
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are based on the propensities of single amino acids and amino acid 
pairs to adopt some backbone torsion angles and describe local 
interactions along the chain. The distance potentials describe 
tertiary interactions and are computed from propensities of amino 
acid pairs to be separated by a certain spatial distance. The total 
folding free energy AG at some temperature T,„ is explicidy 
computed as a linear combination of these different statistical 
potentials, derived from the dataset associated with T„, (Eq.(5)). 
We hence obtain, for each protein p, three folding free energies 
AG^, AGJ and AG*; the coefficients of the combination are 
parameters that are fixed in a further step. In Figure 2 these three 
folding free energies at different temperatures T^, and 7"* are 
depicted on the stability curve of a hypothetical protein. 

Two procedures are used to predict the from these free 
energies. The first assumes a linear correlation between and 
AG*^, which is the standard way of predicting melting tempera- 
tures. The second, novel, procedure consists of assuming a linear 
correlation between and [AG-^-AG^]. In the last step, the 
parameters [i.e. the coefficients of the linear combination of 
statistical potentials) were identified so as to minimize the 
difference between the computed and experimental Tn/s, (Eq.(6)). 
To avoid an overestimation of the performance, we systematically 
performed cross validations using the jack-knife technique as 
explained in the Methods section. 

The first procedure, which assumes a correlation between T,„ 
and AG*, is justified by the fact that the thermodynamic and 
thermal stabilities are sometimes related, even if this is obviously 
not always true. Indeed, in the language of [44] (for a more recent 
review see also [45]), one way for the protein to enhance its 
thermostability is to increase its thermodynamic stability at all 
temperatures, thereby shifting the entire stability curve "down- 
wards", i.e. towards lower AG's. The other two ways to increase 
thermal resistance, namely a decrease of the heat capacity change 
ACp that brings a modification of the shape of the curve and a 
global shift of the curve towards the high temperature region, are 
instead better captured by the second procedure, which assumes a 
correlation between and the difiference between the folding 
free energy at dififerent temperatures, i.e. [AG'^ — AG^j. 

The results of the Tm predictions for aU proteins of our dataset 
are plotted in Figure 4. Figure 4. a shows the correlation between 
the experimental melting temperature and the temperature 
predicted from the folding free energy difference [AG^ — AGjj. 
The associated linear correlation coefficient is equal to 0.68 (P- 
value 10^^). Figure 4.b shows instead the correlation between the 
experimental T^'s and the r^'s predicted from the average 
potential AGj^. The corresponding linear correlation coefficient is 



very low: r = 0.15 and is not statistically significant (P-value 0.3). 
Clearly, the new procedure presented here, which predicts melting 
temperatures from [AG^ — AGj] using 7"-depeiident statistical 
potentials, is much superior to the common procedure that 
predicts T),, from AG* using simple T-independent potentials. 

Focusing on the [AG^ — AGjj-based method, we analyze 
whether some proteins are better predicted than others, and 
whether badly predicted proteins cause a significant decrease of 
the overall performance. In Figure 4.c, the 6 proteins that are 
predicted worst are excluded. To identify these proteins, we 
excluded at each step the protein whose melting temperature is 
predicted worst and we recompute the T^'i of the remaining 
proteins. We repeat the procedure until 6 proteins are excluded. In 
this case the linear correlation coefficient rises up to 0.83 (P-value 
<10-'<'). 

The standard deviations a between the predicted and experi- 
mental values of the melting temperatures, computed for each 
family individually, are reported in Table 1 ; the results per protein 
are given in Table S12 in File SI. On average, a^^ is equal to 
IS.G^C when computed on the basis of the free energy difference 
[AG^ — AGjj. This is significantly better than the average cr-value 
computed with the standard AG*-based method, which yields 
ff*^17.6''C. Moreover, removing the 6 worst predicted proteins 
reduces g^" from 13.6 to S.SOC. For comparison, we added in the 
Table the results obtained in direct validation, which yield a a^^ of 
S.S^C. 

The best predicted families are acylphosphatase, a-amylase and 
/^-lactamase, with cr'^^-values between 5.9 and 7.5''C, while the 
worst are cytochrome P450 and myoglobin, with u'^^-values 
around 19''C. The proteins from the latter two families contain a 
heme, whereas the proteins from the other families contain no 
ligands or very small ones (see Tables Sl-Sl 1 in File SI). As our 
statistical potentials do not take into account the interactions with 
the ligands, mutations in the region of the heme are necessarily not 
estimated properly. The presence of the heme could thus well be 
the reason for the poor predictions in the cytochrome P450 and 
myoglobin families. 

The average T,„ prediction score obtained with the standard, 
AG*-based, method is significantly lower than the one that uses 
[AG'^ — AG^j. It is however noteworthy that some families are 
better predicted with the former method. This is clearly the case 
for the endoglucanase family and to a lower extent for the 
lysozyme family. This result suggests that these proteins are 
thermally stabilized through a shift of the entire stability curve 
towards lower AG-values. 




Figure 4. Melting temperature prediction. Relation between the experimental melting temperature Tf^'^ and the predicted temperatures: (a) 
is computed from the folding free energy difference [AG^ — AGj] (correlation coefficient (■^'^ = 0.68), (b) T* from the folding free energy AG* 
(r* = 0.15), and (c) T^^* from [AG^-AGj] excluding the 6 proteins that are predicted worst (f^-^* = 0.83). 
doi:1 0.1 371 /journal.pone.0091 659.g004 
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Table 1. Values of the standard deviations a^^ and tr* between the measured and the predicted melting temperatures (in 
degrees); a^^* means the standard deviation excluding the 6 proteins whose r,„ is predicted worst; A' indicates the number of 
proteins in the family. 





Family 


L J 




|AG*-AG^| 


|AG*-AG^| 






jack knife 


jack knife 


jack knife 


no jack knife 
















Acylphosphatase 


7.5 


25.2 


4.7 


3.0 


3 


Ribonuclease 


17.3 


23.0 


3.5 


2.7 


5 


Lysozyme 


15.0 


13.2 


8.1 


4.2 


4 


Cell 12A endoglucanase 


13.7 


9.6 


5.4 


4.4 


5 


Adenylate kinase 


12.1 


15.2 


3.3 


9.4 


6 


^-Amylase 


7.5 


9.5 


7.6 


4.2 


4 


c(-Lactalbumin 


17.6 


21.0 


15.9 


6.9 


3 


Myoglobin 


19.9 


19.4 


15.4 


7.8 


3 


Cytochrome P450 


18.6 


21.8 


10.7 


12.0 


5 


^-Lactamase 


5.9 


20.1 


7.1 


2.7 


4 


Cold shock 


14.4 


14.9 


10.2 


3.8 


3 


Average 


13.6 


17.6 


8.3 


5.5 





doi:l 0.1 371 /journal.pone.0091 659.t001 



Discussion 

A complete understanding of the features that determine 
protein thermal stability is still far from being reached. We have 
however made some progress towards this goal. The originality of 
our approach lies in the use of temperature-dependent statistical 
potentials, derived from distinct sets of protein structures, 
containing either mesostable or thermostable proteins. Linear 
combinations of these meso- and thermostable potentials, with 
coefficients identified so as to minimize the standard deviation 
between experimental and predicted T^'s, were used to predict 
the melting temperature on a set of 45 proteins that belong to 1 1 
different homologous families. 

These potentials allowed us to determine in an objective way the 
interactions that contribute most to protein stability in different 
temperature ranges and also, interestingly, the interactions that are 
less destabilizing - in other words, less repulsive - according to the 
temperature. For example, the temperature-dependent distance 
potentials point salt bridges, cation-7t and aromatic interactions to 
contribute more to stability at high temperatures than hydropho- 
bic packing, and conversely, and the interactions between 
positively charged residues to be less repulsive at high than at 
low temperature relative to other interactions [20,21]. 

The novel temperature-dependent torsion potentials introduced 
here show also a significant dependence on the temperature. They 
provide indeed a non-negligible improvement of the T,„ prediction 
performance. However, they are much more difficult to interpret 
in terms of specific interactions than distance potentials. Indeed, 
they reflect the propensities of amino acids and amino acid pairs to 
be associated to backbone torsion angle domains in their vicinity 
along the polypeptide chain, up to eight sequence positions 
further. These propensities are obviously related to secondary 
structure preferences but in an intricate way. 

Another important feature that ensures the success of our 
approach is the focus on families of homologous proteins. We 
indeed defined family- and temperature-dependent statistical 
potentials, that include more proteins of the family under 



consideration and hence bias the potentials towards it. Note that 
we nevertheless kept the pairwise sequence similarity in the set to 
be at most 25%, to avoid uncontrolled biases. As the number of 
proteins with known T),, is quite limited, we also used proteins of 
unknown T„, but of known T^,,, to enlarge the datasets from which 
potentials are derived, using three different rules to roughly 
estimate the former from the latter. 

Note that the same approach as the one proposed here can be 
used for general predictions, independently of protein families. 
However, this - as expected — decreases significantiy the score of 
the predictions. On the other hand, we would like to emphasize 
that our method predicts the T,„ of a given protein from the T„, of 
homologous proteins, which have sometimes very different 
sequences. A much easier goal would be to predict the change 
in melting temperature upon point mutations (AT),,). 

The results presented here are very encouraging, but severely 
suffer from lack of data. Indeed, the number of proteins with 
experimentally determined structure and melting temperature is 
too limited, both for deriving sufificiendy reliable temperature- 
dependent statistical potentials, and for biasing them properly 
towards a given protein family. The comparison of the score 
obtained in cross validation (cr^^ = I3.6°C between predicted and 
measured T„,'s) with the score in direct validation (ff'^^ = 5.5°C) 
indicates that improvement can be expected from an increased 
dataset. Another source of errors is due to the fact that some 
families contain ligands, such as the hemes for the myoglobin and 
cytochrome families. These ligands sometimes strongly affect the 
stabilization properties of the proteins but cannot be taken into 
account in our potentials, which are limited to the residues of tiie 
polypeptide chain. This inevitably brings up the value of a. Finally, 
some experimental error should be included in the evaluation. 
This involves the intrinsic experimental error but, more impor- 
tantiy, the fact that the available experimental data are sometimes 
not performed exactly in the same experimental conditions in 
terms of pH, ionic strength, etc. 

This discussion allows us to conclude on a positive note: the 
performance of our method is already quite good but is expected 
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to significantly improve when larger datasets of proteins with 
known 7"^, obtained in identical experimental conditions, will be 
available. 

Supporting Information 

File SI Table SO, List of proteins with known melting 
temperature used in this study. Table Sl-Sll, List of proteins 
with known Tm or Tenv belonging to the 1 1 homologous families. 
Table SI 2, Experimental and predicted T^'s of the proteins that 
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